PHASE 1: ASK

Summary of the business task

Key tasks:

1. Identify the business task

2. Consider key stakeholders

PHASE 2: PREPARE

Description of all data sources used

Key tasks:

1. Download data and store it appropriately.

File ID and Date Key content
dailyActivity_merged.csv 33 users over 31 days;
4 users provide data less than 25 days*
Steps, Distance, Intensities, Active minutes, Calories
dailyCalories_merged.csv 33 users over 31 days;
4 users provide data less than 25 days*
Calories (duplicated info as in dailyActivity_merged.csv)
dailyIntensities_merged.csv 33 users over 31 days;
4 users provide data less than 25 days*
Intensities (duplicated info as in dailyActivity_merged.csv)
dailySteps_merged.csv 33 users over 31 days;
4 users provide data less than 25 days*
Steps (duplicated info as in dailyActivity_merged.csv)
heartrate_seconds_merged.csv 7 users over 2 days;
2 users provide limited data
Heart rate (second)
hourlyCalories_merged.csv 33 users over 31 days;
4 users provide data less than 25 days*
Calories (hourly)
hourlyIntensities_merged.csv 33 users over 31 days;
4 users provide data less than 25 days*
Intensities (hourly; total and avg)
hourlySteps_merged.csv 33 users over 31 days;
4 users provide data less than 25 days*
Steps (hourly; total)
minuteCaloriesNarrow_merged.csv 33 users over 31 days;
4 users provide data less than 25 days*
Calories (minute)
minuteCaloriesWide_merged.csv 33 users over 31 days;
4 users provide data less than 25 days*
Calories (minute; wide of minuteCaloriesNarrow_merged.csv)
minuteIntensitiesNarrow_merged.csv 33 users over 31 days;
4 users provide data less than 25 days*
Intensity (minute)
minuteIntensitiesWide_merged.csv 33 users over 31 days;
4 users provide data less than 25 days*
Intensity (minute; wide of minuteIntensitiesNarrow_merged.csv)
minuteMETsNarrow_merged.csv 33 users over 31 days;
4 users provide data less than 25 days*
METs = metabolic equivalents (minute)
minuteSleep_merged.csv 24 users over 31 days;
11 users provide data less than 25 days
Sleep stage (minute)
minuteStepsNarrow_merged.csv 33 users over 31 days;
4 users provide data less than 25 days*
Intensity (minute)
minuteStepsWide_merged.csv 33 users over 31 days;
4 users provide data less than 25 days*
Intensity (minute; wide of minuteStepsNarrow_merged.csv)
sleepDay_merged.csv 24 users over 31 days;
14 users provide data less than 25 days
Sleep stage (minute)
weightLogInfo_merged.csv 8 users over 31 days;
7 users provide data less than 25 days
Weight, Body fat, BMI

2. Identify how it’s organized.

3. Sort and filter the data.

4. Determine the credibility of the data.

PHASE 3: PROCESS

Documentation of any cleaning or manipulation of data

Key tasks:

1. Check the data for errors.

2. Choose your tools.

3. Transform the data so you can work with it effectively.

4. Document the cleaning process.

  1. Upload data files under a new project, “Bellabeat”, in BigQuery cloud space. The date and time-related variables cannot be auto-detected, so I import them as STRING first.

  2. Check for primary and foreign keys to build the relationships among tables. Common identifiers: “Id”, and time columns.

  3. Standardize time columns using REGEXP.

  4. Create a master file at the “day” level by merging dailyActivity_merged.csv, dailyCalories_merged.csv, dailyIntensities_merged.csv, dailySteps_merged.csv, and sleepDay_merged.csv based on Id and time.

  5. Create a master file at the “hour” level by merging hourlyCalories_merged.csv, hourlyIntensities_merged.csv, minuteSleep_merged.csv, and hourlySteps_merged.csv based on Id and time.

  6. Create “day”, “day of week”, and “weekend” columns as identifiers for later use.

  7. Export the clean data from BigQuery to the local folder for later use. The data files for this project are small enough for download.

PHASE 4: ANALYZE

Summary of the analysis

Key tasks:

1. Aggregate your data so it’s useful and accessible.

2. Organize and format your data.

3. Perform calculations.

#install.packages('tidyverse')
library(tidyverse)
library(dplyr)

DayActivity <- read_csv("DayActivity_master.csv")
HourActivity <- read_csv("HourActivity_master.csv")

4. Identify trends and relationships.

# install.packages('heatmaply')
library(heatmaply)

# generate total minutes tracked by fitbit
DayActivity_1 <- DayActivity %>% 
  mutate(UseMin = VeryActiveMinutes + FairlyActiveMinutes + LightlyActiveMinutes + SedentaryMinutes) %>% 
  distinct()

# change long format to wide, and creat matrix format
mat <- DayActivity_1 %>% 
  select(Id, ActivityDate, UseMin) %>% 
  arrange(ActivityDate) %>% 
  pivot_wider(names_from = ActivityDate, values_from = UseMin) %>% 
  select(-"Id") %>%  
  as.matrix()

# heatmap
heatmaply(
  mat, 
  dendrogram = "none",
  xlab = "Dates", ylab = "User", 
  main = "Daily Usage of Fitbit (Interactive)",
  grid_color = "white",
  grid_width = 0.00001,
  titleX = FALSE,
  #hide_colorbar = TRUE,
  label_names = c("User", "Date:", "Tracked Minutes"),
  fontsize_row = 6, fontsize_col = 6,
  labCol = colnames(mat),
  labRow = rownames(mat),
  scale_fill_gradient_fun = ggplot2::scale_fill_gradient2(
    low = "white", 
    high = "brown1", 
    midpoint = 600, 
    limits = c(0, 1440)),
  heatmap_layers = theme(axis.line=element_blank())
)
# create group, number of days used, and minutes used per hour
User <- DayActivity_1 %>% 
  distinct() %>% 
  group_by(Id) %>% 
  summarise(num_dayuse = n(), avg_UseMin = round(mean(UseMin)/24)) %>% 
  mutate(use_group = case_when(
    between(num_dayuse, 0,5) ~ "0-5 days, low usage",
    between(num_dayuse, 6,24) ~ "6-24 days, mid usage",
    between(num_dayuse, 25,31) ~ "25-31 days, high usage"
  ))

User_bar <- User %>% 
  group_by(use_group) %>% 
  summarise(num_user = n())

# Compute the position of labels
User_bar <- User_bar %>% 
  arrange(desc(use_group)) %>%
  mutate(prop = num_user / sum(User_bar$num_user) *100) %>%
  mutate(ypos = cumsum(prop)- 0.5*prop )

# Basic piechart
ggplot(User_bar, aes(x="", y=prop, fill=use_group)) +
  geom_bar(stat="identity", width=1, color="white") +
  coord_polar("y", start=0) +
  theme_void() + 
  labs(fill = "User Group", title = "Device Usage of the 33 Users") +
  geom_text(aes(y = ypos, label = num_user), color = "white", size=6)

5. Data integrity checks again.

# Compare hourly data to daily data
# install.packages("formattable")
library(formattable)

User_hour <- HourActivity %>% 
  distinct() %>% 
  group_by(Id) %>% 
  summarise(num_hour = n(), num_dayuse_hour = length(unique(ActivityDate))) %>% 
  mutate(avg_usemin_hour = (num_hour*60)/(num_dayuse_hour*24)) %>% 
  left_join(User, by = "Id") %>% 
  mutate(day_diff = num_dayuse - num_dayuse_hour)

User_tbl <- User_hour %>% 
  arrange(desc(day_diff)) %>% 
  select("Id", "num_dayuse_hour", "num_dayuse", "day_diff") %>% 
  filter(day_diff != 0)

colnames(User_tbl) <- c("User ID", "Used Day from Daily Data", "Used Day from Hourly Data", "Difference (Day)")

formattable(User_tbl, list("Difference (Day)" = color_bar("pink")))
User ID Used Day from Daily Data Used Day from Hourly Data Difference (Day)
1503960366 30 31 1
3977333714 29 30 1
6290855005 28 29 1
8253242879 18 19 1
8583815059 30 31 1
8792009665 28 29 1

PHASE 5: SHARE

Supporting visualizations and key findings

Key tasks:

  1. Determine the best way to share your findings.

  2. Create effective data visualizations.

  3. Present your findings.

  4. Ensure your work is accessible.

Do people use the device more on certain days of the week?

# aggregate data by day of the week (Usage)
bar_data <- DayActivity_1 %>% 
  left_join(User, by = "Id") %>% 
  group_by(use_group, dow, dow_number) %>% 
  summarise(avg_UseMin = mean(UseMin)) %>% 
  arrange(dow_number)

# stacked bar chart (Usage)
ggplot(bar_data, 
       aes(fill=use_group, y=avg_UseMin, x=dow_number)) + 
    facet_wrap(~use_group) +
  geom_bar(position=position_dodge(preserve = "single"), stat="identity", color="white") +
  labs(title = "Usage by Day of the Week", fill = "User Group",
       y = "Average Tracked Minutes", x="") +
  scale_x_continuous(breaks=0:6,
                     labels=c("Mon","Tue","Wed","Thu","Fri","Sat","Sun")) +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        panel.background = element_blank(), axis.line = element_line(colour = "grey")) +
  theme(legend.position="bottom")

So if people who wear the device for more days wear it for a shorter period each day, what motivates them to keep wearing the device?

# aggregate data by day of the week (Calories)
bar_data_calorie <- DayActivity_1 %>% 
  left_join(User, by = "Id") %>% 
  group_by(use_group, dow, dow_number) %>% 
  summarise(avg_calorie = mean(Calories)) %>% 
  arrange(dow_number)

# stacked bar chart (Calories)
ggplot(bar_data_calorie, 
       aes(fill=use_group, y=avg_calorie, x=dow_number)) + 
    facet_wrap(~use_group) +
  geom_bar(position=position_dodge(preserve = "single"), stat="identity", color="white") +
  labs(title = "Calories by Day of the Week", fill = "User Group",
       y = "Average Calorie Consumption", x="") +
  scale_x_continuous(breaks=0:6,
                     labels=c("Mon","Tue","Wed","Thu","Fri","Sat","Sun")) +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        panel.background = element_blank(), axis.line = element_line(colour = "grey")) +
  theme(legend.position="bottom")

So if people who wear the device for more days consume more calories generally, what time are they most active during a day?

# Calculate average calories tracked per hour 
bar_hour <- HourActivity %>% 
  left_join(User, by = "Id") %>% 
  mutate(ActHour = format(as.POSIXct(ActHour), format = "%H:%M")) %>% 
  group_by(ActHour, use_group) %>% 
  summarise(avg_calories = mean(Calories), .groups = "drop") %>% 
  arrange(ActHour, use_group) 

ggplot(bar_hour,
       aes(x = ActHour, y = avg_calories, group = use_group, color = use_group, fill = use_group)) +
  facet_wrap(~use_group) +
  #geom_line() +
  geom_area(alpha=0.2) +
  labs(title = "Calorie Consumption by Group by Hour", fill = "User Group",
       y = "Average Calorie Consumption", x="") + 
  #theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        #panel.background = element_blank(), axis.line = element_line(colour = "grey")),
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=0.5, size=6)) +
  theme(legend.position="none")

PHASE 6: ACT

Top high-level content recommendations based on analysis

Key tasks:

  1. Create your portfolio.

  2. Add your case study.

  3. Practice presenting your case study to a friend or family member.

Key takeaways:

1. How are people using their smart devices?

2. How could these trends apply to Bellabeat customers?

3. How could these trends inform Bellabeat marketing strategy?

Limitations:

Next steps: